1. Probability Distributions 

2. What is a probability distribution function? 
3. Signal Parameter Estimation 

4. Linear Estimators 


Probability Distributions 


The distribution Px of a random variable X is simply a probability measure 
which assigns probabilities to events on the real line. The distribution Px 
answers questions of the form: 


What is the probability that X lies in some subset F’ of the real line? 


In practice we summarize Px by its Probability Mass Function - pmf (for 
discrete variables only), Probability Density Function - pdf (mainly for 
continuous variables), or Cumulative Distribution Function - cdf (for 
either discrete or continuous variables). 


Probability Mass Function (pmf) 


Suppose the discrete random variable X can take a set of VW real values 
{x1,..-,£u}, then the pmf is defined as: 
Equation: 


px(vi) = Pr[X =a; 
Px({xi}) 


where 4 px(a;) = 1. e.g. For a normal 6-sided die, M = 6 and 
px(#i) = - For a pair of dice being thrown, M = 11 and the pmf is as 
shown in (a) of [link]. 


@piici2-drs thous 


— @3 45678901112 
tot! thow 


2 


(b) caf of 2-dice throws 


id) pot of normal distribution: maan=2. var = 3 
o2 . . . 


03 


02 


be 
a 
a 
a 
a 


0 2 4 é a 0 12 
tom! tow 


(c) pdt of 2-dice throws 


= noimal distribution: mean = 2 var =3 
1 


Examples of pmfs, cdfs and pdfs: (a) to (c) for a 
discrete process, the sum of two dice; (d) and (e) 
for a continuous process with a normal or 
Gaussian distribution, whose mean = 2 and 
variance = 3. 


Cumulative Distribution Function (cdf) 


The cdf can describe discrete, continuous or mixed distributions of X and is 


defined as: 
Equation: 


Fy(a) = Pr[x <a 
= Px((—ov,#]) 


For discrete X: 
Equation: 


F(z) = | {px(xi)| vi < x} 


giving step-like cdfs as in the example of (b) of [Link]. 


Properties follow directly from the Axioms of Probability: 


1.0 < F(a) < 1 

2. Fx(—co) = 0, Fx(oo) == 1 

3. Fx (a) is non-decreasing as x increases 
A. Pri <A < r2| = Fx(2x2) — Fyx(21) 
5. Pr[X > «] =1—- Fx(z) 


where there is no ambiguity we will often drop the subscript X and refer to 
the cdf as F(x). 


Probability Density Function (pdf) 


The pdf of X is defined as the derivative of the cdf: 
Equation: 


The pdf can also be interpreted in derivative form as 6(x) — 0: 
Equation: 


fx(x)6(z2) = Prlan< X<2+6(z)| 
= Fx(a+ 6(x)) — Fx(2) 


For a discrete random variable with pmf given by px(a;): 
Equation: 


An example of the pdf of the 2-dice discrete random process is shown in (c) 
of [link]. (Strictly the delta functions should extend vertically to infinity, but 
we show them only reaching the values of their areas, px(2;).) 


The pdf and cdf of a continuous distribution (in this case the normal or 
Gaussian distribution) are shown in (d) and (e) of [link]. 


Note:The cdf is the integral of the pdf and should always go from zero to 
unity for a valid probability distribution. 


Properties of pdfs: 


1. fx(x) > 0 

2. ee des 1 

3. Fx(z) = © fx(a)da 

4.Pria, <— X <4] = is fx(a) da 


As for the cdf, we will often drop the subscript X and refer simply to f(z) 
when no confusion can arise. 


What is a probability distribution function? 


A mathematical function can be used to model the frequencies and 
probabilities of occurrences over time. A discrete probability distribution 
function associates a list of probabilities with each possible value of a 
discrete random variable. The probability distribution function is thus used 
to model the probabilities of a discrete random variable and is also known 
as a probability mass function. The probabilities of a continuous random 
variable are modelled using continuous distribution functions, also known 
as probability density functions (pdf's). 


The following are particularly important forms of the probability 
distribution function. 


e Binomial distribution function. Used to model experiments with only 
two possible outcomes. 

¢ Poisson distribution function. Used to model experiments with high 
degrees of uncertainty. 

e Normal distribution. Used to model continuous distributions. 


Example: 

This discrete probability density function models experiments that have 
only two possible outcomes. The probability of success is p and the 
probability of failure is q=1-p. The pdf models the probability that we will 
observe r sucesses and n-r failures in a total of n-trials. 


f Probability Density Function 
| 


° ~ 


f Cumulatve Probabeity Distnbution Function 


Graph of the probability distribution function and the cumulative 
probability distribution function (redrawn from 


m_using matlab) 


Exercise: 


Problem: 


From the example above, what is the probability that in 20-trials there 
are exactly six successes? 


Solution: 


The probability that there are exactly six successes is 0.04 


References: 


1. Random Variables and their Probability Density and Distribution 
Functions, 
(last accessed February 2006) 

2. NCAR Advanced Study Program http://www.asp.ucar.edu (last 
accessed February 2006) 


Co-Author: Mookho Tsilo 


Signal Parameter Estimation 


One extension of parametric estimation theory necessary for its application 
to array processing is the estimation of signal parameters. We assume that 
we observe a signal s(J, 8), whose characteristics are known save a few 
parameters @, in the presence of noise. Signal parameters, such as 
amplitude, time origin, and frequency if the signal is sinusoidal, must be 
determined in some way. In many cases of interest, we would find it 
difficult to justify a particular form for the unknown parameters’ a priori 
density. Because of such uncertainties, the minimum mean-squared error 
and maximum a posteriori estimators cannot be used in many cases. The 
minimum mean-squared error linear estimator does not require this density, 
but it is most fruitfully used when the unknown parameter appears in the 
problem in a linear fashion (such as signal amplitude as we shall see). 


Linear Minimum Mean-Squared Error Estimator 


The only parameter that is linearly related to a signal is the amplitude. 
Consider, therefore, the problem where the observations at an array's output 
are modeled as 

Equation: 


V1,l€ {0,...,L—1}: (r(l) = 6s(l) + n(l)) 


The signal waveform s(1) is known and its energy normalized to be unity ( 
y>, 82(1) = 1). The linear estimate of the signal's amplitude is assumed to 


be of the form 8 = 5°, h(l)r(l), where h(1) minimizes the mean-squared 
error. To use the Orthogonality Principle expressed by this equation, an 
inner product must be defined for scalars. Little choice avails itself but 
multiplication as the inner product of two scalars. The Orthogonality 
Principle states that the estimation error must be orthogonal to all linear 
transformations defining the kind of estimator being sought. 


Vh(.): | EB Sy hun(I)r(l) — 6 h(k)r(k)| = 0 
i=0 


Manipulating this equation to make the universality constraint more 
transparent results in 


Vh(-) : ( h(k)E | (S: hun (l)r(l) — 7 i) = | 


Written in this way, the expected value must be 0 for each value of k to 
satisfy the constraint. Thus, the quantity hyp (-) of the estimator of the 
signal's amplitude must satisfy 


Vk: (s: hun (1) E[r()r(k)] = eto) 
1=0 


Assuming that the signal's amplitude has zero mean and is statistically 
independent of the zero-mean noise, the expected values in this equation are 
given by 


E[r(1)r(k)] = 9°s(l)s(k) + Kn(k, 1) 
El6r(k)] = o92s(k) 


where K’,(k, 1) is the covariance function of the noise. The equation that 
must be solved for the unit-sample response hyyy(-) of the optimal linear 
MMSE estimator of signal amplitude becomes 

Equation: 


L-1 L(1) 
Vk: (S- hu (1) Ky(k, 1) = o978(k) ( — hu(et0) 
1=0 1=0 


This equation is easily solved once phrased in matrix notation. Letting kK, 
denote the covariance matrix of the noise, s the signal vector, and hyn the 
vector of coefficients, this equation becomes 


K, hun = 99" (1— 8" hun) s 


The matched filter for colored-noise problems consisted of the dot product 
between the vector of observations and K,, ‘s (see the detector result). 
Assume that the solution to the linear estimation problem is proportional to 
the detection theoretical one: hujw = cK ae where c is a scalar constant. 
This proposed solution satisfies the equation; the MMSE estimate of signal 
amplitude corresponds to applying a matched filter to the observations with 
Equation: 

o6" i 


hun = —————-K, °8 
1 +o92s'K,, 's 


The mean-squared estimation error of signal amplitude is given by 
Fie 
Ble] = 09 — B)0S) hr (r(1) 
1=0 


Substituting the vector expression for hyn yields the result that the mean- 
squared estimation error equals the proportionality constant c defined 
earlier. 


2 
Ble) 

1+o92sTK,, ‘8 
Thus, the linear filter that produces the optimal estimate of signal amplitude 
is equivalent to the matched filter used to detect the signal's presence. We 
have found this situation to occur when estimates of unknown parameters 
are needed to solve the detection problem (see Detection in the Presence of 
Uncertainties). If we had not assumed the noise to be Gaussian, however, 
this detection-theoretic result would be different, but the estimator would be 
unchanged. To repeat, this invariance occurs because the linear MMSE 
estimator requires no assumptions on the noise's amplitude characteristics. 


Example: 


Let the noise be white so that its covariance matrix is proportional to the 
identity matrix (K,, = o,7J). The weighting factor in the minimum 
mean-squared error linear estimator is proportional to the signal waveform. 


Oo 
hun(l) = ——— (0) 
A oo? L-1 
OLIN = 5 s(l)r(J) 


This proportionality constant depends only on the relative variances of the 
noise and the parameter. If the noise variance can be considered to be 
much smaller than the a priori variance of the amplitude, then this constant 
does not depend on these variances and equals unity. Otherwise, the 
variances must be known. 

We find the mean-squared estimation error to be 


2 
06 
ame 
On? 


This error is significantly reduced from its nominal value a? only when 
the variance of the noise is small compared with the a priori variance of the 
amplitude. Otherwise, this admittedly optimum amplitude estimate 
performs poorly, and we might as well as have ignored the data and 
"guessed" that the amplitude was zero| footnote]. 

In other words, the problem is difficult in this case. 


Linear Estimators 


We derived the minimum mean-squared error estimator in the previous section with no constraint on 
the form of the estimator. Depending on the problem, the computations could be a linear function of 
the observations (which is always the case in Gaussian problems) or nonlinear. Deriving this estimator 
is often difficult, which limits its application. We consider here a variation of MMSE estimation by 
constraining the estimator to be linear while minimizing the mean-squared estimation error. Such 
linear estimators may not be optimum; the conditional expected value may be nonlinear and it 
always has the smallest mean-squared error. Despite this occasional performance deficit, linear 
estimators have well-understood properties, they interact will with other signal processing algorithms 
because of linearity, and they can always be derived, no matter what the problem. 


Let the parameter estimate 9(r) be expressed as Y(r) where Y(.) is a linear operator: 

L(ayr, + Aero) = a, L(r1) + a2 L(r2) where aj, ay are scalars. Although all estimators of this 
form are obviously linear, the term linear estimator denotes that member of this family that 
minimizes the mean-squared error. 

Equation: 


argmin Ele’e| — bun (r) 


Because of the transformation's linearity, the theory of linear vector spaces can be fruitfully used to 
derive the estimator and to specify its properties. One result of that theoretical framework is the well- 


transformation that yields an estimation error orthogonal to all linear transformations of the data. The 
orthogonality of the error to all linear transformations is termed the universality constraint. This 
principle provides us not only with a formal definition of the linear estimator but also with the 
mechanism to derive it. To demonstrate this intriguing result, let (-,-) denote the absract inner product 
between two vectors and || - || the associated norm. 

Equation: 


(|| @ ||)” = (@, @) 


For example, if # and y are each column matrices having only one column,| footnote] their inner 
product might be defined as (a, 2) = a7 y. Thus, the linear estimator as defined by the Orthogonality 
Principle must satisfy 

Equation: 


Y for all linear transformations Y(-) : (z | (Gun (r) — 0, £(r))| = 0) 
To see that this principle produces the MMSE linear estimator, we express the mean-squared 


estimation error Eee] = E (i € )”| for any choice of linear estimator 6 as 


Equation: 


B|(- etl)" | = 2|(Il uw —9~ (Bw ~4) 1)"| 


= al be i)" - a|(I bun — 8 i) = 2E| (un — 6, 61m — 0) 


As Graig — 6 is the difference of two linear transformations, it too is linear and is orthogonal to the 


estimation error resulting from @, yn. As a result, the last term is zero and the mean-squared estimation 
error is the sum of two squared norms, each of which is, of course, nonnegative. Only the second 
norm varies with estimator choice; we minimize the mean-squared estimation error by choosing the 


estimator 6 to be the estimator a which sets the second term to zero. 

There is a confusion as to what a vector it. "Matricies having one column" are colloquially termed 
vectors as are the field quantities such as electric and magnetic fields. "Vectors" and their associated 
inner products are taken to be much more general mathematical objects than these. Hence the prose in 
this section is rather contorted. 


The estimation error for the minimum mean-squared linear estimator can be calculated to some degree 
without knowledge of the form of the estimator. The mean-squared estimation error is given by 
Equation: 


B|(\| ux —8l)] = B[(@uss ~8,4u0s ~ 8) 


= af (a8 dx) +2 (ee) 


The first term is zero because of the Orthogonality Principle. Rewriting the second term yields a 
general expression for the MMSE linear estimator's mean-squared error. 
Equation: 


El (ile 1I)”] = B[(1 6 1") - 2 | (xm, 9)| 


This error is the difference of two terms. The first, the mean-squared value of the parameter, 
represents the largest value that the estimation error can be for any reasonable estimator. That error 
can be obtained by the estimator that ignores the data and has a value of zero. The second term 
reduces this maximum error and represents the degree to which the estimate and the parameter agree 
on the average. 


Note that the definition of the minimum mean-squared error linear estimator makes no explicit 
assumptions about the parameter estimation problem being solved. This property makes this kind of 
estimator attractive in many applications where neither the a priori density of the parameter vector nor 
the density of the observations is known precisely. Linear transformations, however, are 
homogeneous: A zero-values input yields a zero output. Thus, the linear estimator is especially 
pertinent to those problems where the expected value of the parameter is zero. If the expected value is 
nonzero, the linear estimator would not necessarily yield the best result (See this problem) 


Example: 
Express the first example in vector notation so that the observation vector is written as 


r=Adt+n 


where the matrix A has the form A = (1.. alee The expected value of the parameter is zero. The 


linear estimator has the form ae = Lr, where Lisa1 x LD matrix. The orthogonality Principle 
states that the linear estimator satisfies 


Vfor all 1 x Z matricies M : (2 (Zr = 6)" Mr = 0) 


To use the Orthogonality Principle to derive an equation implicitly specifying the linear estimator, the 
"for all linear transformations" phrase must be interpreted. Usually the quantity specifying the linear 
transformation must be removed from the constraining inner product by imposing a very stringent but 
equivalent condition. In this example, this phrase becomes one about matrices. The elements of the 
matrix J can be such that each element of the observation vector multiplies each element of the 
estimation error. Thus, in this problem the Othogonality Principle means that the expected value of 
the matrix consisting of all pairwise priducts of these elements must be zero. 


E|(Lr —6)r7| =0 


Thus, two terms must equal each other: £7 [Lrr?] =a; [or | . The second term equals & [9?| A? as 
the additive noise and the parameter are assumed to be statistically independent quantities. The 
quantity & [rr7] in the first term is the correlation matrix of the observations, which is given by 

AA [9?| + K,,. Here, K,, is the noise covariance matrix, and & [6?| is the parameter's variance. 
The quantity AA? is a L x L matrix with each element equaling 1. The noise vector has 
independent components; the covariance matrix thus equals o,,2J. The equation that L must satisfy is 
therefore given by 


On +06" oe” ae oo 
2 2 oe. : 
on) On +06 . : 
(Ly mor L_z) =(e57 206 Oo") 
. og 
oe” ee O9 On? +09" 
2 
. ee O69 OS 
The components of L are equal and are given by L; = STG Thus, the minimum mean-squared 
error linear estimator has the form 
o9" il 


Ou (r) = 7 So r(i) 


2 
By ae W 
Note that this result equals the minimum mean-squared error estimate derived earlier under the 
condition that [6] = 0. Mean-squared error, linear estimators, and Gaussian problems are intimately 
related to each other. The linear minimum mean-squared error solution to a problem is optimal if the 
underlying distributions are Gaussian. 


